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Abstract 

Most of the existing classification methods are aimed at minimization 
of empirical risk (through some simple point-based error measured with 
loss function) with added regularization. We propose to approach this 
problem in a more information theoretic way by investigating applicability 
of entropy measures as a classification model objective function. We focus 
on quadratic Renyi’s entropy and connected Cauchy-Schwarz Divergence 
which leads to the construction of Extreme Entropy Machines (EEM). 

The main contribution of this paper is proposing a model based on 
the information theoretic concepts which on the one hand shows new, 
entropic perspective on known linear classihers and on the other leads to 
a construction of very robust method competetitive with the state of the 
art non-information theoretic ones (including Support Vector Machines 
and Extreme Learning Machines). 

Evaluation on numerous problems spanning from small, simple ones 
from UCI repository to the large (hundreads of thousands of samples) 
extremely unbalanced (up to 100:1 classes’ ratios) datasets shows wide 
applicability of the EEM in real life problems and that it scales well. 


1 Introduction 

There is no one, universal, perfect optimization criterion that can be used to 
train machine learning model. Even for linear classifiers one can find multi¬ 
ple objective functions, error measures to minimize, regularization methods to 
include [14| . Most of the existing methods are aimed at minimization of empir¬ 
ical risk (through some simple point-based error measured with loss function) 
with added regularization. We propose to approach this problem in more in¬ 
formation theoretic way by investigating applicability of entropy measures as a 
classification model objective function. We focus on quadratic Renyi’s entropy 
and connected Cauchy-Schwarz Divergence. 

One of the information theoretic concepts which has been found very ef¬ 
fective in machine learning is the entropy measure. In particular the rule of 
maximum entropy modeling led to the construction of MaxEnt model and its 
structural generalization - Conditional Random Fields which are considered 
state of the art in many applications. In this paper we propose to use Renyi’s 
quadratic cross entropy as the measure of two density estimations divergence 


I 


in order to find best linear classifier. It is a conceptually different approach 
than typical entropy models as it works in the input space instead of decisions 
distribution. As a result we obtain a model closely related to the Fischer’s 
Discriminant (or more generally Linear Discriminant Analysis) which deepens 
the understanding of this classical approach. Together with a powerful extreme 
data transformation we obtain a robust, nonlinear model competetive with the 
state of the art models not based on information theory like Support Vector 
Machines (SVM 0). Extreme Learning Machines (ELM [^) or Least Squares 
Support Vector Machines (LS-SVM [^). We also show that under some sim- 
plifing assumptions ELM and LS-SVM can be seen through a perspective of 
information theory as their solutions are (up to some constants) identical to the 
ones obtained by proposed method. 

Paper is structured as follows: first we recall some preliminary information 
regarding ELMs and Support Vector Machines, including Least Squares Sup¬ 
port Vector Machines. Next we introduce our Extreme Entropy Machine (EEM) 
together with its kernelized extreme counterpart - Extreme Entropy Kernel 
Machine (EEKM). We show some connections with existing models and some 
different perspectives for looking at proposed model. In particular, we show 
how learning capabilities of EEMs (and EEKM) reasamble those of ELM (and 
LS-SVM respectively). During evaluation on over 20 binary datasets (of vari¬ 
ous sizes and characteristics) we analyze generalization capabilities and learning 
speed. We show that it can be a valuable, robust alternative for existing meth¬ 
ods. In particular, we show that it achieves analogous of ELM stability in terms 
of the hidden layer size. We conclude with future development plans and open 
problems. 


2 Preliminaries 

Let us begin with recalling some basic information regarding Extreme Learning 
Machines and Support Vector Machines which are further used as a 
competiting models for proposed solution. We focus here on the optimization 
problems being solved to underline some basic differences between these methods 
and EEMs. 

2.1 Extreme Learning Machines 

Extreme Learning Machines are relatively young models introduced by Huang et 
al. which are based on the idea that single layer feed forward neural networks 
(SLFN) can be trained without iterative process by performing linear regression 
on the data mapped through random, nonlinear projection (random hidden 
neurons). More precisely speaking, basic ELM architecture consists of d input 
neurons connected with each input space dimension which are fully connected 
with h hidden neurons by the set of weights Wj (selected randomly from some 
arbitrary distribution) and set of biases bj (also randomly selected). Given some 
generalized nonlinear activation function G one can express the hidden neurons 
activation matrix H for the whole training set X,T = {(Xi,ti)}(Yi such that 
Xi 6 and € {-1. + 1} as 


Hij :bj'). 
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we can formulate following optimization problem 

Optimization problem: Extreme Learning Machine 

minimize || H/3 - T || ^ 

f) 

where i = 1,.. .,NJ = l,...,h 


If we denote the weights between hidden layer and output neurons by (3 it 


is easy to show 11 that putting 


/3 = H'T, 


gives the best solution in terms of mean squared error of the regression: 

||H^ - Tf = |H(H1'T) - xf = min ||Ha - xf 

aelR^ 

where denotes Moore-Penrose pseudoinverse of matrix H. 

Final classihcation of the new point x can be now performed analogously by 
classifying according to 

d(x) = sign([G(x,u;i,6i) ... G{x,Wd,bd)]P). 

As it is based on the oridinary least squares optimization, it is possible to 
balance it in terms of unbalanced datasets by performing weighted ordinary 
least squares. In such a scenario, given a vector B such that Bi is a square root 
of the inverse of the Xj’s class size and B-X denotes element wise multiplication 
between B and X: 


(3= {B-U^B-T 

2.2 Support Vector Machines and Least Squares Support 
Vector Machines 

One of the most well known classifiers of the last decade is Vapnik’s Support 
Vector Machine (SVM [^), based on the principle of creating linear classifier 
that maximizes the separating margin between elements of two classes. 

Optimization problem: Support Vector Machine 

1 ^ 

minimize - ||/3||^ + C V0 
subject to t,{{f3,Xi) + b) = 

which can be further kernelized (delinearized) using any kernel K (valid in the 
Mercer’s sense): 

Optimization problem: Kernel Support Vector Machine 

N ^ N 

maximize 'Zf3i-- ^ /3,/3jLtjK(Xi,Xj) 

^ 2=1 ^ i,j=l 

N 

subject to ^ /3jti = 0 

i=l 

0</3, <C, i=l,...,iV 
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The problem is a quadratic optimization with linear constraints, which can 
be efficiently solved using quadratic programming techniques. Due to the use of 
hinge loss function on SVM attains very sparse solutions in terms of nonzero 
As a result, classifier does not have to remember the whole training set, but 
instead, the set of so called support vectors {SV = {x^ : (3^ > 0}), and classify 
new point according to 

cZ(x) = siGNj ^ /3iK(Xi,x) + . 

\Xi6sy / 

It appears that if we change the loss function to the quadratic one we can 
greatly reduce the complexity of the resulting optimization problem, leading to 
the so called Least Squares Support Vector Machines (LS-SVM). 

Optimization problem: Least Squares Support Vector Machine 

minimize - ||/3||^ + C V £f 

subject to ti{{/3,Xi) + b) = 1 - i = 1,... ,N 

and decision is made according to 

cl{x) = SIGN((/3,x) + b) 

As Suykens et al. showed this can be further generalized for abitrary kernel 
induced spaces, where we classify according to: 

cl{x) = SIGN /3jK(Xj, x) + fej 

where /3j are Lagrange multipliers associated with particular training examples 
Xi and & is a threshold, found by solving the linear system 


'0 

'b' 


'o' 

1 K(X,X)+//C 



T 


where 1 is a vector of ones and I is an identity matrix of appropriate dimensions. 
Thus a training procedure becomes 


b 


'0 

-1 

'o' 



1 K(X,X) + //C' 


T 


Similarly to the classical SVM, this formulation is highly unbalanced (it’s results 
are skewed towards bigger classes). To overcome this issue one can introduce a 
weighted version 20 , where given diagonal matrix of weights Q, such that Qu 


is invertibly proportional to the size of x^’s class and . 


'b 


0 

-1 

'o' 

P. 


1 K(X,X)+Q/C' 


T 


Unfortunately, due to the introduction of the square loss, the Support Vector 
Machines sparseness of the solution is completely lost. Resulting training has a 
closed form solution, but requires the computation of the whole Gram matrix 
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and the resulting machine has to remembei[^ whole training set in order to 
perform new point’s classification. 


3 Extreme Entropy Machines 


Let us first recall the formulation of the linear classification problem in the 
highly dimensional feature spaces, ie. when number of samples N is equal (or 
less) than dimension of the feature space h. In particular we formulate the 
problem in the limiting cas^ when h = oo: 

Problem 1. We are given finite number of (often linearly independent) points 
H* in an infinite dimensional Hilbert space %. Points constitute the 

positive class, while H~ e the negative class. 

We search for j3 e H such that the sets /3^H^ and /3^H- are optimally 
separated. 

Observe that in itself (without additional regularization) the problem is not 
well-posed as, by applying the linear independence of the data, for arbitrary 
TO+ + m_ in IR we can easily construct (3 €^1 such that 

/3^H'" = {m+}and/3^H- = {m_}. (1) 


However, this leads to a model case of overfitting, which typically yields subop- 
timal results on the testing set (different from the orginal training samples). 

To make the problem well-posed, we typically need to: 

1. add/allow some error in the data. 


2. specify some objective function including term penalising model’s com¬ 
plexity. 


Popular choices of the objective function include per-point classification loss 
(like square loss in neural networks or hinge loss in SVM) with a regularization 
term added, often expressed as the square of the norm of our operator f3 (like 
in SVM or in weight decay regularization of neural networks). In general one 
can divide objective functions derivations into following categories: 

• regression based (like in neural networks or ELM), 

• probabilistic (like in the case of Naive Bayes), 

• geometric (like in SVM), 

• information theoretic (entropy models). 


We focus on the last group of approaches, and investigate 
the Cauchy-Schwarz divergence 12 , which for two densities 


the applicability of 
/ and g is given by 


D^,{f, g) = In ( + In (y' j _ 2 In ( /s) 

= -21n(y' 


^there are some pruning techniques for LS-SVM but we are not investigating them here 
^which is often obtained by the kernel approach 
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Cauchy-Schwarz divergence is connected to Renyi’s quadratic cross entropy 
{H 2 [^) and Renyi’s quadratic entropy (i? 2 ), defined for densities f,g as 

H2{f,9) = -\n^J fg^ 


consequently 

D^,{f,g) = 2H^if,g) - H^U) " H^ig). 

and as we showed in [^, it is well-suited as a discrimination measure which 
allows the construction of mulit-threshold linear classifiers. In general increase 
of the value of Cauchy-Schwarz Divergence results in better sets’ (densities’) 
discrimination. Unfortunately, there are a few problems with such an approach: 

• true datasets are discrete, so we do not have densities / and g, 

• statistical density estimators require rather large sample sizes and are very 
computationally expensive. 


There are basically two approaches which help us recover underlying densi¬ 
ties from the samples. First one is performing some kind of density esimation, 
like the well known Kernel Density Estimation (KDE) technique, which is based 
on the observation that any arbitrary continuous distribution can be sufficiently 
approximated by the convex combination of Gaussians. The other approach is 
to assume some density model (distribution’s family) and fit its parameters in 
order to maximize the data generation probability. In statistics it is known as 
maximum likelihood esetimation (MLE) approach. MLE has an advantage that 
in general it produces much simplier densities descriptions than KDE as later’s 
description is linearly big in terms of sample size. 

A common choice of density models are Gaussian distributions due to their 
nice theoretical and practical (computational) capabilities. As mentioned eariler, 
the conxex combination of Gaussians can approximate the given continuous dis¬ 
tribution / with arbitrary precision. In order to fit a Gaussian Mixture Model 
(GMM) to given dataset one needs algorithm like Expectation Maximization 


or conceptually similar Gross-Entropy Glustering 22 . However, for simplicity 


and strong regularization we propose to model / as one big Gaussian M{m, S). 
One of the biggest advantages of such an approach is closed form MLE param¬ 
eter estimation, as we simply put m equal to the empirical mean of the data, 
and S as some data covariance estimator. Secondly, this way we introduce an 
error to the data which has an important regularizing role and leads to better 
posed optimization problem. 

Let us now recall that the Shannon’s differential entropy (expressed in nits) 
of the continuous distribution / is 


HU) = - f 


we will now show that choice of Normal distributions is not arbitrary but sup¬ 
ported by the assumptions of the entropy maximization. Following result is 
known, but we include the whole reasoning for completeness. 
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Remark 1. Normal distribution has a maximum Shannon’s differen¬ 

tial entropy among all real-valued distributions with mean m € and covarianee 


Proof. Let / and g be arbitrary distributions with covariance E and means m. 
For simplicity we assume that m = 0 but the analogous proof holds for arbitrary 
mean, then 

J fHiHjdliidiij = J gHiHjdiiidHj = 'Eij, 

so for quadratic form A 

jAf.fA,. 

Notice that 

ln7V'(0,E)[H] =ln( , ^ exp(-^H'^E~^H) | 

\V(27r)Met(E) ^ ') 

= -iln((27r)^det(E))- 

is a quadratic form plus constant thus 

J /ln7V(0,E) = J 7V(0,E)ln7V(0,E), 

which together with non-negativity of Kullback-Leibler Divergence gives 

0<Dk.(/||AA(0,E)) 

/lnAA(0,E) 

= J f\nM{0,^) 

= J AA(0,E)lnAA(0,E) 

= -R(/)+R(^(0,E)), 

thus 

R(AA(0,E))>R(/). 

□ 


There appears nontrivial question how to find/estimate the desired Gaussian 
as the covariance can be singular. In this case to regularize the covariance we 
apply the well-known Ledoit-Wolf approach [T^. 


E* = covt(H^) = (1 - £:*)cov(H*) + e^tr(cov(H*))/i-i/, 
where cov(-) is an empirical covariance and is a shrinkage coefficient given 


by Ledoit and Wolf 15 


Thus, our optimization problem can be stated as follows: 
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Figure 1: Extreme Entropy Machine (on the left) and Extreme Entropy Kernel 
Machine (on the right) as neural networks. In both cases all weights are either 
randomly selected (dashed) or are the result of closed-form optimization (solid). 


Problem 2 (Optimization problem). Suppose that we are given two datasets H"*" 
in a Hilbert space H which come from the Gaussian distributions M. 
Find f3 €^1 such that the datasets 

/3^H+ and 


are optimally discriminated in terms of Cauchy-Schwarz Divergence. 

Because H* has density A/’(to'^, S"^), has the density N {(dF vnf, ff" (3). 

We put 

m^= 13'^m^, S^ = I3'^E^I3. ( 2 ) 

Since, as one can easily compute [^, 

r Af{m+,S+) Af{m-,S-) 

J \\Af{m+,S +)\\2 ||7\/'(to_,5_)||2 

Af{m+ - m_, S+ + S'_)[0] 

" (A/'(0,2^0[OMO,2^_)[0])^/2 

(27r5'+S'_)^'^"^ / (m+-m_)^\ 

" {s^ + s.y/^ 2(5+ + 5_) j’ 

we obtain that 


Dcs{^f{m+,S+),^^{m-,S-)) 
Af{m+,S+) 
\\Af{m+,S +)\\2 


= - In 


.-iln^ 
2 2 


Af{m-,S-) ' 
\J\f{m.,S.)\\ 2 , 
2 


■In 2 


|(5'+ + S-) ^ [jn+ - m_) 




s^ + s. 


(3) 


Observe that in the above equation the first term is constant, the second is the 
logarithm of the quotient of arithmetical and geometrical means (and therefore 
in the typical cases is bounded and close to zero). Consequently, crucial infor¬ 
mation is given by the last term. In order to confirm this claim we perform 
experiments on over 20 datasets used in further evaluation (more details are lo¬ 
cated in the Evaluation section). We compute the Spearman’s rank correlation 






















coefficient between the Dcs(J^{rn+, S+),J\f{m-, S-)) and ^ ^ for hundread 
random projections to H and hundread random linear operators /3. 


Table 1: Spearman’s rank correlation coefficient between optimized term and 
whole Z?os for all datasets used in evaluation. Each column represents different 
dimension of the Hilbert space. 


dataset 

1 

10 

100 

200 

500 

AUSTRALIAN 

0.928 

- 0.022 

0.295 

0.161 

0.235 

BREAST-CANCER 

0.628 

0.809 

0.812 

0.858 

0.788 

DIABETES 

-0.983 

-0.976 

-0.941 

-0.982 

-0.952 

GERMAN.NUMER 

0.916 

0.979 

0.877 

0.873 

0.839 

HEART 

0.964 

0.829 

0.931 

0.91 

0.893 

IONOSPHERE 

0.999 

0.988 

0.98 

0.978 

0.984 

LIVER-DISORDERS 

0.232 

0.308 

0.363 

0.33 

0.312 

SONAR 

-0.31 

-0.542 

-0.41 

-0.407 

-0.381 

SPLICE 

-0.284 

-0.036 

-0.165 

-0.118 

- 0.101 

ABALONE? 

1.0 

0.999 

0.999 

0.999 

0.998 

ARYTHMIA 

1.0 

1.0 

0.999 

1.0 

1.0 

BALANCE 

1.0 

0.998 

0.998 

0.999 

0.998 

CAR EVALUATION 

1.0 

0.998 

0.998 

0.997 

0.997 

ECOLI 

0.964 

0.994 

0.995 

0.998 

0.995 

LIBRAS MOVE 

1.0 

0.999 

0.999 

1.0 

1.0 

OIL SPILL 

1.0 

1.0 

1.0 

1.0 

1.0 

SICK EUTHYROID 

1.0 

0.999 

1.0 

1.0 

1.0 

SOLAR FLARE 

1.0 

1.0 

1.0 

1.0 

1.0 

SPECTROMETER 

1.0 

1.0 

0.999 

0.999 

0.999 

FOREST COVER 

0.988 

0.997 

0.997 

0.992 

0.988 

ISOLET 

0.784 

1.0 

0.997 

0.997 

0.999 

MAMMOGRAPHY 

1.0 

1.0 

1.0 

1.0 

1.0 

PROTEIN HOMOLOGY 

1.0 

1.0 

1.0 

1.0 

1.0 

WEBPAGES 

1.0 

1.0 

1.0 

0.999 

0.999 


As one can see in Table in small datasets (first part of the table) the 
correlation is generally high, with some exceptions (like SONAR, splice, liver- 
DISORDERS and diabetes). However, for bigger datasets (consisting of thou¬ 
sands examples) this correlation is nearly perfect (up to the randomization pro¬ 
cess it is nearly 1.0 for all cases) which is a very strong empirical confirmation 
of our claim that maximization of the ^”5 is generally equivalent to the 

maximization of Ilcs(-A/'(m+, 5'+), A/’(to_, S'-)). 

This means that, after the above reductions, and application of ([^ our final 
problem can be stated as follows: 
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Optimization problem: Extreme Entropy Machine 


minimize 

/3 

subject to 
where 


/3^E+/3 +/3^E-/3 

f3^{m'*' - m~) = 2 


E* =covt(H 
1 


|H^ 


E 






Before we continue to the closed-form solution we outline two methods of 
actually transforming our data X* c ff to the highly dimensional H* c "H, given 
by the Lp- X ^T-L. 

We investigate two approaches which lead to the Extreme Entropy Machine 
and Extreme Entropy Kernel Machine respectively. 

• for Extreme Entropy Machine (EEM) we use the random projection 
technique, exactly the same as the one used in the ELM. In other words, 
given some generalized activation function G(x, ?ii,5):dfxA’x[R^lR and 
a constant h denoting number of hidden neurons: 

9 X ->• [G(x,ici, 6i),..., G(x, Wh^bhY^ 6 

where Wi are random vectors and bi are random biases. 

• for Extreme Entropy Kernel Machine (EEKM) we use the random¬ 
ized kernel approximation technique , which spans our Hilbert space on 
randomly selecteed subset of training vectors. In other words, given valid 
kernel K(-,-) : X x X ^ and size of the kernel space base h: 

Pk-.Xbx^ (K(x,X[^])K(Xt'*],Xt'‘])“i/2)'^ € 

where X^'*] is a h element random subset of X. It is easy to verify that 
such low rank approxmation truly behaves as a kernel, in the sense that 
for v5K(Xi),(/3K(Xj) e 

</5k(Xj)^</Jk(Xj) = 

= ((K(Xi,Xt^])K(X['‘],X['*])"i/^)^)'^ 

{K{y, Xf^] )K(X['"], Xt'^] 
=K(Xi,X['"])K(Xf^],X['‘])“^/^ 

{K{y, Xt^] )K(X['*], Xt'^] 

K(X[^], Xt'*] , Xt^]) 

=K(x„ Xt'*] )K(X[^], XC'^] )-iK(X['‘], Xj), 

given true kernel projection </)k such that 

K(Xi,Xj) = (/'k(x*)^(?!)k(Xj) 
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we have 


K(Xi, )K(X['‘], X['*])"^K(Xt^], Xj) = 

= 0K(Xi)^0K(Xl^^^) 

(<^k(X['*])^0k(xM))-i 

0K(Xl^^^)^(/)K(Xj) 

= 0K(Xi)^0K(Xl^^^)(/)K(Xl^^^) ^ 
(<^k(X['*])^)-Vk(xM)^,/>k(x,) 

= 0K(Xj)^</<K(Xj) 

= K(Xi,Xj). 

Thus for the whole samples’ set we have 

</^k(X)^(^k(X)=K(X,X), 
which is a complete Gram matrix. 

So the only difference between Extreme Entropy Machine and Extreme En¬ 
tropy Kernel Machine is that in later we use = (/5 k(X=’=) where K is a selected 
kernel instead of H'*' = (/^(X*). Fig. visualizes these two approaches as neu¬ 
ral networks, in particular EEM is a simple SEEN, while EEKM leads to the 
network with two hidden layers. 



Figure 2: Visualization of the whole EEM classification process. From the left: 
Linearly non separable data in X\ data mapped to the 'H. space, where we find 
covariance estimators; density of projected Gaussians on which the decision is 
based; decision boundary in the input space X. 


Remark 2. Extreme Entropy Machine optimization problem is closely related 
to the SVM optimization, but instead of maximizing the margin between closest 
points we are maximizing the mean margin. 

Proof. Let us recall that in SVM we try to maximize the margin under 

constraints that negative samples are projected at values at most -1 (/3 ^h“ + b< 
-1) and positive samples on at least 1 (/3 ^h^ + & > 1) In other words, we are 
minimizing the /3 operator norm 

m 

which is equivalent to minimizing the square of this norm ||/3p, under constraint 
that 

min - max {/3 ^h“} = 1 - (-1) = 2. 

H+ 6 H+ h- 6 H- 
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On the other hand, EEM tries to minimize 


+ (3^T.-(3 = + E-)/3 

= ll/3|||++i;- 


under the constraint that 


1 

|H+ 


E 

H+€H+ 


1 

|H- 


E /3^h- = 2. 

H-eH- 


So what is happening here is that we are trying to maximize the mean margin 
between classes in the Mahalanobis norm generated by the sum of classes’ co- 
variances. It was previously shown in Two ellipsoid Support Vector Machines 
model that such norm is an approximation of the margin coming from two 
ellpisoids instead of the single ball used by traditional SVM. □ 


Similar observation regarding connection between large margin classification 
and entropy optimization has been done in case of the Multithreshold Linear 
Entropy Classiher [^. 

We are going to show by applying the standard method of Lagrange multi¬ 
pliers that the above problem has a closed form solution (similar to the Fischer’s 
Discriminant). Let 

E = + E~ and m = m*' - m~. 


We put 
Then 


L(/3, A) := 2/3^E/3 - A(/3^m - 2). 


d T 

VyL = 2E/3- Am and —L = p m-2, 
oX 

which means that we need to solve, with respect to ,3, the system 


2E/3 - Am = 0, 

(3^171 = 2. 


Therefore ,3 = ^E ^m, which yields 


^m^E ^m = 2, 


and consequentl}0 if m 0, then A = 4/||m||| and 

,3= ^E-^m 

2(E+ + E-)-i(m+-m-) 


( 4 ) 


n -||2 


IIE+ + S- 


The final decision of the class of the point H is therefore given by the com¬ 
parison of the values 


A/'(/3^m+,,3'^E'",3)[,3" h] and m“,,3^ E^,3)[/3" h]. 


We distinguish two cases based on number of resulting classifier’s thresholds 
(points t such that N {(3^ mX, (3^ = A/’(,3^m“,/3^E“/3)[t]): 

®where ||»n|||, = denotes the squared Mahalanobis norm of m. 
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1. S- = S+, then there is one threshold 

to = ru- + 1, 

which results in a traditional (one-threshold) linear classifier, 

2. S- + S+, then there are two thresholds 

^ , 2S_±V'S-S+(ln(S-/S+)(S_-S+)+4) 

which makes the resulting classifier a member of two-thresholds linear 
classifiers family [^. 

Obviously, in the degenerated case, when m = 0 m~ = rri^ there is no 

solution, as the constraint f3^{m~ - m^) = 2 is not fulfilled for any j3. In such 
a case EEM returna a trivial classifier constantly equal to any class (we put 


(3 = 0). 


From the neural network perspetive we simply construct a custom activation 
function F(-) in the output neuron depending on one of the two described cases: 




if t- < t+ and 



otherwise. 

The whole classification process is visualized in Fig. we begin with data in 
the input space X, transform it into Hilbert space "H where we model them as 
Gaussians, then perform optimization leading to the projection on IR through (3 
and perform densitiy based classification leading to non-linear decision boundary 
in T. 

4 Theory: density estimation in the kernel case 

To illustrate our reasoning, we consider a typical basic problem concerning the 
density estimation. 

Problem 3. Assume that we are given a finite data set H in a Hilbert space % 
generated by the unknown density f, and we want to obtain estimate of f. 

Since the problem in itself is infinite dimensional typically the data would 
be linearly independent. Moreover, one usually can not obtain reliable density 
estimation - the most we can hope is that after transformation by a linear 
functional into R, the resulting density will be well-estimated. 

To simplify the problem assume therefore that we want to find the desired 
density in the class of normal densities - or equivalently that we are interested 
only in the estimation of the mean and covariance of /. 

The generalization of the above problem is given by the following problem: 
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Problem 4. Assume that we are given a finite data sets H'*' in a Hilbert space 
H, generated by the unknown densities , and we want to obtain estimate of 
the unknown densities. 


In general dim(H) = h » N which means that we have very sparse data in 
terms of Hilbert space. As a result, classical kernel density estimation (KDE) 
is not reliable source of information 16 . In the absence of different tools we 


can however use KDE with very big kernel width in order to cover at least some 
general shape of the whole density. 


Remark 3. Assume that we are given a finite data sets H* with means and 
covariances E"*" in a Hilbert space H. If we conduct kernel density estimation 
using Gaussian kernel then, in a limiting case, each class becomes a Normal 
distribution. 

lim ||[H^],-AA(m^(T 2 E *)||2 = 0, 

( 7—*■00 

where 

= jji Y,Af{a,a'^ -coviA)) 

a^A 

Proof of this remark is given by Czarnecki and Tabor and means that if we 
perform a Gaussian kernel density estimation of our data with big kernel width 
(which is reasonable for small amount of data in highly dimensional space) then 
for big enough a EEM is nearly optimal linear classifier in terms of estimated 
densities 

Let us now investigate the probabilistic interpretation of EEM. Under the 
assumption that H* - E"^) we have the conditional probabilities 


p(h|±) =A/'(m=",E*)[H], 
so from Bayes rule we conclude that 

p(,i„). 

p(h) 

oc E*)[h]p(±), 

where p{±) is a prior classes’ distribution. In our case, due to the balanced na¬ 
ture (meaning that despite classes imbalance we maximize the balanced quality 
measure such as Averaged Accuracy) we have p{±) = 1/2. 

But 

p(h)= E P(H|t), 

te{+ 

SO 

Furthermore it is easy to show that under the normality assumption, the 
resulting classifier is optimal in the Bayesian sense. 

Remark 4. If data in feature space comes from Normal distributions Af , E"*") 
then (3 given by EEM minimizes probability of missclassification. More strictly 
speaking, if we draw with probability 1/2 from A/'(m^,E^) and H“ with 1/2 
from Af {m~ ,Tr) then for any a 6 

p(t|/3^h’^) <p(T|a^H’^) 
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5 Theory: learning capabilities 


First we show that under some simplifing assumptions, proposed method be¬ 


haves as Extreme Learning Machine (or Weighted Extreme Learning Machine 25 ). 

Before proceeding further we would like to remark that there are two pop¬ 
ular notations for projecting data onto hyperplanes. One, used in ELM model, 
assumes that H is a row matrix and /3 is a column vector, which results in the 
projection’s equation H/3. Second one, used in SVM and in our paper, assumes 
that both H and (3 are column oriented, which results in the /3^H projection. 

In the following theorem we will show some duality between (3 found by ELM 
and by EEM. In order to do so, we will need to change the notation during the 
proof, which will be indicated. 


Theorem 1. Let us assume that we are given an arbitrary, balance^ dataset 
{(Xi, Xi € 6 {-1,+1}, |X~| = |X^| which can be perfectly learned 

by ELM with N hidden neurons. If this dataset’s points’ image through ran¬ 
dom neurons H = V5(X) is centered (points’ images have 0 mean) and classes 
have homogenous covariances (we can assume that 3a±6R+cov(H) = a+cov(H'’') = 
a_cov(H“) then EEM with the same hidden layer will also learn this dataset 
perfectly (with 0 error). 


Proof. In the first part of the proof we use the ELM notation. Projected 
data is centered, so cov(H) = H^H. ELM is able to learn this dataset per¬ 
fectly, consequently H is invertible, thus also H^H is invertible, as a result 
cov|(H) = cov(H) = H^H. We will now show that 3o(eR+/3elm = a '/3 eem- 
First, let us recall that /3 elm = H^T = H’^T and /3 eem = 
where = cov|(H*). Due to the assumption of geometric homogenity /3 eem = 
||m+-m-p ( °a ’ where E = cov|(H). Therefore 

/3elm = H-'T 

= (H^Hj-^H^T 
= cov|“ i(H)H^T 


From now we change the notation back to the one used in this paper. 

/^ELM 


E-l[ ^ (+1-H")+ ^ (-1-H-)) 

\h+gH+ h-gH- / 

E E H-) 

\H+eH+ h-gH- / 


1 N 

= E — (m^ - m ) 

2 ^ ^ 

N II mf - m~ III a+ -I- a_ 

2 2 a+O- 

- ® ■ /3eem) 


/^EEM 


for a = ™ IIs e R+. Again from homogenity we obtain just one 

equilibrium point, located in the /3 eem(™^ “ to“)/ 2 which results in the exact 
same classifier as the one given by ELM. This completes the proof. 

^analogous result can be shown for unbalanced dataset and Weighted ELM with particular 
weighting scheme. 
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Similar result holds for EEKM and Least Squares Support Vector Machine. 

Theorem 2. Let us assume that we are given arbitrary, balanee^ dataset 
{(Xi ,Xi e 6 {-1,+1},|X“| = |X^| which can he perfectly learned 

by LS-SVM. If dataset’s points’ images through Kernel induced projection (fK. 
have homogenous classes’ covariances (we can assume that 3a±6R+cov((^K(X)) = 
a+cov((pK(X^)) = a_cov((pK(X“)) then EEKM with the same kernel and N 
hidden neurons will also learn this dataset perfectly (with 0 error). 


Proof. It is a direct consequence of the fact that with N hidden neurons and 
honogenous classes projections covariances, EEKM degenerates to the kernelized 
Fischer Discriminant which, as Gestel et al. showed 
solution of the Least Squares SVM. 


24 


is equivalent to the 
□ 


6 Practical considerations 


We can formulate the whole EEM training as a very simple algorithm (see 

Alg.§. 


Algorithm 1 Extreme Entropy (Kernel) Machine 

train(X^, X“) 

build (fi using Algorithm 

H'= ^ (/5(X±) 

E*^covt(H=^) 

/3 <- 2 (E^ + S“) ^ {rrP - m“)/||m^ - m~ ||s++i:- 
F(x) = argmax^g|^ _}. A/’(/3^m*,/3^E*^)[a;] 
return (3, tp, F 

predict (X) 

return F(/3^(/3(X)) 


Algorithm 2 Lp building 

Extreme Entropy Machine(G, h) 

select randomly Wi,bi for ie h} 

(p{x) = [G{x,wi,bi),...,G{x,Wh,bh)]'^ 

return (p 

Extreme Entropy Kernel Machine(K, h, X) 
select randomly X^^] c X, |X[^]| = h 
^ K(XM,X['‘])-i/2 
V3k(x) = K['‘]K(X[^],x) 
return (^k 


^analogous result can be shown for unbalanced dataset and Balanced LS-SVM with par¬ 
ticular weighting scheme. 
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Resulting model consists of three elements: 

• feature projection function 

• linear operator (3, 

• classification rule F. 

As described before, F can be further compressed to just one or two thresholds 
using equations from previous sections. Either way, complexity of the resulting 
model is linear in terms of hidden units and classihcation of the new point takes 
0{dh) time. 

During EEM training, the most expensive part of the algorithm is the com¬ 
putation of the covariance estimators and inversion of the sum of covariances. 
Even computation of the empirical covariance takes 0{Nh^) time so the total 
complexity of training, equal to 0{h^ + Nh?) = 0{Nh^), is acceptable. It is 
worth noting that training of the ELM also takes exactly 0{Nh^) time as it 
requires computation of H^H for H 6 Training of EEMK requires ad¬ 

ditional computation of the square root of the sampled kernel matrix inverse 

K(XW,X['']) € can be computed in 0{dh?) 

and both inverting and square rooting can be done in 0{h^) we obtain exact 
same asymptotical computational complexity as the one of EEM. Procedure of 
square rooting and inverting are both always possible as assuming that K is a 
valid kernel in the Mercer’s sense yields that K(X[^],X[^]) is strictly positive 
dehnite and thus invertible. Further comparision of EEM, ELM and SVM is 
summarized in Table [21 

Next aspect we would like to discuss is the cost sensitive learning. EEMs 
are balanced models in the sense that they are trying to maximize the balanced 
quality measures (like Averaged Accuracy or GMean). However, in practical 
applications it might be the case that we are actually more interested in the 
positive class then the negative one (like in the medical applications). Proposed 
model gives a direct probability estimates of p{f3 ^which we can easily 
convert to the cost sensitive classifier by introducing the prior probabilities of 
each class. Directly from Bayes Theorem, given p{+) and p(-), we can label 
our new sample H according to 

p(t|/3^H) ocp(t)p(/3^H|t), 

so if we are given costs C+,C- 6 R+ we can use them as weighting of priors 
d{x) = argmax p*^"^ p(/3^H|t). 

Let us now investigate the possible efficiency bottleneck. In EEKM, the 
classification of the new point H is based on 

d{x) = F{f3^PK{x)) 

= F(,3^(K(x,Xt'‘])K['*])^) 

= F(/3'^(K[''])^K(x,X['"])^) 

= F((K[^]/3)^K(X[^],x)). 
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One can convert EEKM to the SEEN by putting: 

(^K(x) = K(Xt'‘],x) 

3 = K[^]/3, 


so the classification rule becomes 

d(x) = F0^(pK(x)). 

This way complexity of the new point’s classification is exactly the same as in 
the case of EEM and ELM (or any other SEEN). 


7 Evaluation 


For the evaluation purposes we implemented five methods, namely: Weighted 
Extreme Learning Machine (WELM [^), Extreme Entropy Machine (EEM), 
Extreme Entropy Kernel Machine (EEKM), Least Squares Support Vector Ma¬ 
chines (LS-SVM 21 ) and Support Vector Machines (SVM [^). 

All methods but SVM were implemented using Python with use of the 


bleeding-edge versions of NUMPY 23 and SCIPY libraries included in ANA- 
CONDi^ for fair comparision. For SVM we used highly efficient libSVM 


library with bindings avaliable in SCIKIT-learn 17 . Random projection based 


methods (WELM, EEM) were tested using three following generalized activation 
functions G{x,w,b) 

• sigmoid (SIG): , 

• normalized sigmoid (nsig): ^ 

• radial basis function (rbf): exp(-6||r(; - Xp). 

Random parameters (weights and biases) were selected from uniform distri¬ 
butions on [0,1]. Training of WELM was performed using Moore-Penrose 
pseudoinverse and of EEM using Ledoit-Wolf covariance estimator, as both are 
parameter less, closed form estimators of required objects. For kernel meth¬ 
ods (EEKM, LS-SVM, SVM) we used the Gaussian kernel (rbf) K.y(Xi,Xj) = 
exp(- 7 ||Xi - Xj p). In all methods requiring class balancing schemes (WELM, 
LS-SVM, SVM) we used balance weights Wi equal to the ratio of bigger class 
and current class (so Witi = 0 ). 

Metaparameters of each model were fitted, performed grid search included: 
hidden layer size h = 50,100,250,500,1000 (WELM, EEM, EEKM), Gaussian 
Kernel width 7 = 10~^°,..., 10° (EEKM, LS-SVM, SVM), SVM regularization 
parameter C = 10“^,..., 10^° (LS-SVM, SVM). 

Datasets’ features were linearly scaled in order to have each feature in the in¬ 
terval [0,1]. No other data whitening/filtering was performed. All experiments 
were performed in repeated 10 -fold stratified cross-validation. 

We use GMeais]^ (geometric mean of accuracy over positive and negative 
samples) as an evaluation metric, due to its balanced nature and usage in 


previous works regarding Weighted Extreme Learning Machines 25 


^https://store.continuum.io/cshop/anaconda/ 
7GMean(TP,FP,TN,FN) = 
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Table 3: Characteristics of used datasets 


dataset 

d 

|x-| 

|xi 

AUSTRALIAN 

14 

383 

307 

BANK 

4 

762 

610 

BREAST CANCER 

9 

444 

239 

DIABETES 

8 

268 

500 

GERMAN NUMER 

24 

700 

300 

HEART 

13 

150 

120 

LIVER-DISORDERS 

6 

145 

200 

SONAR 

60 

111 

97 

SPLICE 

60 

483 

517 

ABALONE? 

10 

3786 

391 

ARYTHMIA 

261 

427 

25 

CAR EVALUATION 

21 

1594 

134 

ECOLI 

7 

301 

35 

LIBRAS MOVE 

90 

336 

24 

OIL SPILL 

48 

896 

41 

SICK EUTHYROID 

42 

2870 

293 

SOLAR FLARE 

32 

1321 

68 

SPECTROMETER 

93 

486 

45 

FOREST COVER 

54 

571519 

9493 

ISOLET 

617 

7197 

600 

MAMMOGRAPHY 

6 

10923 

260 

PROTEIN HOMOLOGY 

74 

144455 

1296 

WEBPAGES 

300 

33799 

981 


7.1 Basic UCI datasets 

We start our experiments with nine datasets coming from UCI repository |^, 
namely AUSTRALIAN, breast-cancer, diabetes, german.numer, heart, 
IONOSPHERE, liver-disorders, SONAR and SPLICE, Summarized in the Ta¬ 
ble This datasets include rather balanced, low dimensional problems. 

On such data, EEM seems to perform noticably better than ELM when using 
RBF activation function (see Table |^, and rather similar when using sigmoid 
one - in such a scenario, for some datasets ELM achieves better results while 
for other EEM wins. Results obtained for EEKM are comparable with those 
obtained by LS-SVM and SVM, in both cases proposed method achieves better 
results on about third of problems, on the third it draws and on a third it loses. 
This experiments can be seen as a proof of concept of the whole methodology, 
showing that it can be truly a reasonable alternative for existing models in 
some problems. It appears that contrary to ELM, proposed methods (EEM 
and EEKM) achieve best scores across all considered models in some of the 
datasets regardless of the used activation function/kernel (only Support Vector 
Machines and their least squares counterpart are competetitive in this sense). 
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7.2 Highly unbalanced datasets 

In the second part we proceeded to the nine highly unbalanced datasets, sum¬ 
marized in the second part of the Table Ratio between bigger and smaller 
class varies from 10 : 1 to even 20 : 1 which makes them really hard for un¬ 
balanced models. Obtained results (see Table resembles these obtained on 
UCI repository. We can see better results in about half of experiments if we 
fix a particular activation function/kernel (so we compare ELMa, with EEMa, 
and LS-SVMa; with EEKM^,). Table shows that training time of Extreme 
Entropy Machines are comparable with the ones obtained by Extreme Learning 
Machines (differences on the level of 0.1-0.2 are not significant on such datasets’ 
sizes). We have a robust method which learns in below two seconds a model 
for hundreads/thousands of examples. For larger datasets (like ABALONeT or 
SICK euthyroid) proposed methods not only outperform SVM and LS-SVM in 
terms of robustness but there is also noticable difference between their training 
times and ELMs. This suggests that even though ELM and EEM are quite sim¬ 
ilar and on small datasets are equally fast, EEM can better scale up to truly big 
datasets. Obviously obtained training times do not resemble the full training 
time as it strongly depends on the technique used for metaparameters selection 
and resolution of grid search (or other parameters tuning technique). In such 
full scenario, training times of SVM related models is significantly bigger due 
to the requirment of exact tuning of both C and 7 in real domains. 

7.3 Extremely unbalanced datasets 

Third part of experiments consists of extremely unbalanced datasets (with class 
imbalance up to 100 : 1 ) containing tens and hundreads thousands of examples. 
Five analyzed datasets span from NLP tasks (webpages) through medical ap¬ 
plications (mammography) to bioinformatics (protein homology). This 
type of datasets often occur in the true data mining which makes these results 
much more practical than the one obtained on small/balanced data. 

0.0 scores on Isolet dataset (see Table for sigmoid based random pro¬ 
jections is a result of very high values (~ 200) of (x,u>) for all X, which 

results in G{x,w,b) = 1, so the whole dataset is reduced to the singleton 
{[l,...,l]^}clR^c'H which obviously is not separable by any classifier, netither 
ELM nor EEM. 

For other activation functions we see that EEM achieves sllightly worse re¬ 
sults than ELM. On the other hand, scores of EEKM generally outperform the 
ones obtained by ELM and are very close to the ones obtained by well tuned 
SVM and LS-SVM. In the same time, EEM and EEKM were trained signif¬ 
icantly faster, as Table shows, it was order of magnitude faster than SVM 
related models and even 1.5 - 2x faster than ELM. It seems that the Ledoit- 
Wolf covariance estimation computation with this matrices inversion is simply 
a faster operation (scales better) than computation of the Moore-Penrose pseu¬ 
doinverse of the H^H. Obviously one can alternate ELM training routine to the 
regularized one where instead of (H^H)’^ one computes (H^H + 7/(7)“^, but 
we are analyzing here parameter less approaches, while the analogous could be 
used for EEM in the form of (cov(X“) +cov(X^) +//(7)“^ instead of computing 
Ledoit-Wolf estimator. In other words, in the parameter less training scenario, 
as described in this paper EEMs seems to scale better than ELMs while still 
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obtaining similar classification results. In the same time EEKM obtains SVM- 
level results with orders of magnitude smaller training times. Both ELM and 
EEM could be transformed into regularization parameter based learning, but 
this is beyond the scope of this work. 


7.4 Entropy based hyperparameter optimization 

Now we proceed to entropy based evaluation. Given particular set of linear 
hypotheses At in H we want to select optimal set of hyperparameters 9 (such 
as number of hidden neurons or regularization parameter) which identify a par¬ 
ticular model f3g & M. c 'H. Instead of using expensive internal cross-validation 
(or other generalization error estimation technique like Err° ®^^) we select such 
9 which maximizes our entropic measure. In particular we consider a simpified 
Cauchy-Schwarz Divergence based strategy where we select 9 maximizing 

Dcs (A/'(/3 J m" , var(/3 J H+ )), A/'(/3 Jm’, var(/3 J H~) ) ), 
and kernel density based entropic strategy selecting 9 maximizing 

Dcs([/3jH"l,[/3^H-l), (5) 


where [A] = [A]ct(a) is a Gaussian KDE using Silverman’s rule of the window 
width 

= (aw) ' std(A) « ^std(A). 

This way we can use whole given set for training and do not need to repeat 
the process, as Dcs is computed on the training set instead of the hold-out set. 

First, one can notice on Table and Table 10 that such entropic criterion 
works well for EEM, EEKM and Support Vector Machines. On the other hand, 
it is not very well suited for ELM models. This confirms conclusions from our 
previous work on classihcation using Dcs where we claimed that SVMs are 
conceptually similar in terms of optimization objective, as well as widens it to 
the new class of models (EEMs). Second, Table shows that EEM and EEKM 
can truly select their hyperparameters using very simple technique requiring no 
model retrainings. Gomputation of ([^ is linear in terms of training set and 
constant time if performed using precomputed projections of required objects 
(which are either way computed during EEM training). This make this very 
fast model even more robust. 


7.5 EEM stability 

It was previously reported that ELMs have very stable results in the wide 
range of the number of hidden neurons. We performed analogous experiments 
with EEM on UCI datasets. We trained models for 100 increasing hidden layers 
sizes (h = 5, 10,..., 500) and plotted resulting GMean scores on Fig. 
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Figure 3: Plot of the EEM’s (with RBE activation function) GMean scores 
from cross validation experiments for increasting sizes of hidden layer. 

One can notice that similarly to ELM proposed methods are very stable. 
Once machine gets enough neurons (around 100 in case of tested datasets) fur¬ 
ther increasing of the feature space dimension has minor effect on the general¬ 
ization capabilities of the model. It is also important that some of these datasets 
(like sonar) do not even have 500 points, so there are more dimensions in the 
Hilbert space than we have points to build our covariance estimates, and even 
though we still do not observe any rapid overfitting. 


8 Conclusions 

In this paper we have presented Extreme Entropy Machines, models derived 
from the information theoretic measures and applied to the classification prob¬ 
lems. Proposed methods are strongly related to the concepts of Extreme Learn¬ 
ing Machines (in terms of general workflow, rapid training and randomization) 
as well as Support Vector Machines (in terms of margin maximization interpre¬ 
tation as well as LS-SVM duality). 

Main characteristics of EEMs are: 

• information theoretic background based on differential and Renyi’s quadratic 
entropies, 

• closed form solution of the optimization problem, 

• generative training, leading to direct probability estimates, 

• small number of metaparameters, 

• good classification results, 
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• rapid training that scales well to hundreads of thousands of examples and 
beyond, 

• theoretical and practical similarities to the large margin classifiers and 
Fischer Discriminant. 

Performed evaluation showed that, similarly to ELM, proposed EEM is a 
very stable model in terms of the size of the hidden layer and achieves compara¬ 
ble classification results to the ones obtained by SVMs and ELMs. Furthermore 
we showed that our method scales better to truly big datasets (consisting of 
hundreads of thousands of examples) without sacrificing results quality. 

During our considerations we pointed out some open problems and issues, 
which are worth investigation: 

• Can one construct a closed-form entropy based classifier with different 
distribution families than Gaussians? 

• Is there a theoretical justification of the stability of the extreme learning 
techniques? 

• Is it possible to further increase achieved results by performing unsuper¬ 
vised entropy based optimization in the hidden layer? 
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